MentaLiST quick start

This notebook shows some examples on how to run MentaLiST to create new MLST scheme databases, either downloading from public MLST websites or from custom files, and then calling alleles for NGS samples.


In [1]:
# depending on how you installed mentalist, you might have to add it and julia to the PATH:
PATH=$PATH:/rhome/pfeijao/sfu/MentaLiST/src:/rhome/pfeijao/bin

It might also be a good idea to create a new folder to store the results of the examples below:


In [2]:
mkdir -p /tmp/mentalist_quick_start
cd /tmp/mentalist_quick_start

Help

MentaLiST.jl is the main script, with several commands available. To see a list of commands, run MentaLiST with the -h flag:


In [3]:
# Help: shows all available commands:
mentalist -h


usage: mentalist [-v] [-h]
                 {call|build_db|db_info|list_pubmlst|download_pubmlst|list_cgmlst|download_cgmlst|download_enterobase}

commands:
  call                 MLST caller, given a sample and a k-mer
                       database.
  build_db             Build a MLST k-mer database, given a list of
                       FASTA files.
  db_info              Extract information from an existing MentaLiST
                       k-mer database
  list_pubmlst         List all available MLST schemes from
                       www.pubmlst.org.
  download_pubmlst     Dowload a MLST scheme from pubmlst and build a
                       MLST k-mer database.
  list_cgmlst          List all available cgMLST schemes from
                       www.cgmlst.org.
  download_cgmlst      Dowload a MLST scheme from cgmlst.org and build
                       a MLST k-mer database.
  download_enterobase  Dowload a MLST scheme from Enterobase
                       (enterobase.warwick.ac.uk) and build a MLST
                       k-mer database.

optional arguments:
  -v, --version        show version information and exit
  -h, --help           show this help message and exit

MentaLiST -- The MLST pipeline developed by the PathOGiST research group. https://github.com/WGS-TB/MentaLiST
To cite: Feijao P, Yao H, Fornika D, Gardy J, Hsiao W, Chauve C, Chindelevitch L. 10/01/2018. Microbial Genomics 4(2): doi:10.1099/mgen.0.000146

To see the help of a particular command, run MentaLiST with the command name and the -h flag:


In [4]:
mentalist call -h


usage: mentalist call -o O --db DB [-t MUTATION_THRESHOLD] [--kt KT]
                      [--output_votes] [--output_special]
                      [-i SAMPLE_INPUT_FILE] [-1 [_1...]] [-2 [_2...]]
                      [--fasta] [-h]

optional arguments:
  -o O                  Output file with MLST call
  --db DB               Kmer database
  -t, --mutation_threshold MUTATION_THRESHOLD
                        Maximum number of mutations when looking for
                        novel alleles. (type: Int64, default: 6)
  --kt KT               Minimum # of times a kmer is seen to be
                        considered present in the sample (solid).
                        (type: Int64, default: 10)
  --output_votes        Outputs the results for the original voting
                        algorithm.
  --output_special      Outputs a FASTA file with the alleles from
                        'special cases' such as incomplete coverage,
                        novel, and multiple alleles.
  -i, --sample_input_file SAMPLE_INPUT_FILE
                        Input TXT file for multiple samples. First
                        column has the sample name, second the FASTQ
                        file. Repeat the sample name for samples with
                        more than one file (paired reads, f.i.)
  -1 [_1...]            FastQ input files, one per sample, forward
                        reads (or unpaired reads).
  -2 [_2...]            FastQ input files, one per sample, reverse
                        reads.
  --fasta               Input files are in FASTA format, instead of
                        the default FASTQs.
  -h, --help            show this help message and exit

MentaLiST MLST calling function. Calls alleles on a given MLST database.
You can create a custom DB with 'create_db' or other MentaLiST functions that download schemes from pubmlst, cgmlst.org or Enterobase.

Examples:
mentalist call -o my_sample.mlst --db my_scheme.db -1 sample_1.fastq.gz -2 sample_2.fastq.gz # one paired-end sample.
mentalist call -o all_samples.mlst --db my_scheme.db -1 *.fastq.gz -2 *.fastq.gz # multiple paired-end samples.

Installing MLST schema

MentaLiST needs to create a k-mer database file for a given MLST scheme before it can call alleles. There are different possible options, from custom schema based on local FASTA files, to downloading public schema from pubmlst.org or cgmlst.org.

pubMLST schema

MentaLiST can search and install MLST schema from pubMLST.org, as shown.

List Available pubmlst.org schema

The command 'list_publist' lists the available schema on pubMLST. Since there are many, it is also possible to give a prefix, such that only schema matching this prefix are listed.


In [5]:
mentalist list_pubmlst -h


usage: mentalist list_pubmlst [-p PREFIX] [-h]

optional arguments:
  -p, --prefix PREFIX  Only list schemes where the species name starts
                       with this prefix.
  -h, --help           show this help message and exit


In [6]:
# List campylobacter schema:
mentalist list_pubmlst -p Campylobacter


#id	organism
23	Campylobacter concisus/curvus 
24	Campylobacter fetus           
25	Campylobacter helveticus      
26	Campylobacter hyointestinalis 
27	Campylobacter insulaenigrae   
28	Campylobacter jejuni          
29	Campylobacter lanienae        
30	Campylobacter lari            
31	Campylobacter sputorum        
32	Campylobacter upsaliensis     
[ Info: 10 scheme(s) found.

Install a pubmlst.org scheme

A scheme can be referenced by species name (exact match) or, more simply, but the ID as given in the 'list_pubmlst' command. To install the 'Campylobacter jejuni' scheme, run the following command:


In [7]:
mentalist download_pubmlst -k 31 -o campy_mlst_fasta_files -s 28 --db campy_mlst.db


[ Info: Searching for the scheme ... 
[ Info: Downloading scheme for Campylobacter jejuni ... 
[ Info: Downloading profile ...
Trying to download: https://pubmlst.org/data/profiles/campylobacter.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  322k  100  322k    0     0   229k      0  0:00:01  0:00:01 --:--:--  229k
[ Info: Downloading locus aspA ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/aspA.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  235k  100  235k    0     0   170k      0  0:00:01  0:00:01 --:--:--  170k
[ Info: Downloading locus glnA ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/glnA.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  315k  100  315k    0     0   227k      0  0:00:01  0:00:01 --:--:--  227k
[ Info: Downloading locus gltA ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/gltA.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  229k  100  229k    0     0   166k      0  0:00:01  0:00:01 --:--:--  166k
[ Info: Downloading locus glyA ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/glyA.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  375k  100  375k    0     0   269k      0  0:00:01  0:00:01 --:--:--  269k
[ Info: Downloading locus pgm ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/pgm.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  486k  100  486k    0     0   291k      0  0:00:01  0:00:01 --:--:--  291k
[ Info: Downloading locus tkt ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/tkt.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  351k  100  351k    0     0   252k      0  0:00:01  0:00:01 --:--:--  252k
[ Info: Downloading locus uncA ...
Trying to download: https://pubmlst.org/data/alleles/campylobacter/uncA.tfa
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  279k  100  279k    0     0   202k      0  0:00:01  0:00:01 --:--:--  202k
[ Info: Finished downloading.
[ Info: Building the k-mer database ...
[ Info: Opening FASTA files ... 
[ Info: Combining results for each locus ...
[ Info: Saving DB ...
[ Info: Done!

In [8]:
# The output folder (-o) has all the FASTA files and profile for the scheme.
ls campy_mlst_fasta_files


aspA.tfa           glnA.tfa  glyA.tfa  tkt.tfa
campylobacter.txt  gltA.tfa  pgm.tfa   uncA.tfa

In [9]:
# The --db flag indicates the database file, the will be used by MentaLiST in the calling phase.
ls -lh campy_mlst.db


-rw-rw-r--. 1 pfeijao pfeijao 843K Feb 14 10:33 campy_mlst.db

cgMLST schema

Similarly with the pubMLST schema, MentaLiST can also download and install cgMLST schema from cgmlst.org.

List available cgMLST schema from cgmlst.org


In [10]:
mentalist list_cgmlst


#id	organism
3956907	Acinetobacter baumannii       
6398355	Brucella melitensis           
3560802	Clostridioides difficile      
991893	Enterococcus faecium          
260204	Francisella tularensis        
2187931	Klebsiella pneumoniae/variicola/quasipneumoniae
1025099	Legionella pneumophila        
690488	Listeria monocytogenes        
741110	Mycobacterium tuberculosis/bovis/africanum/canettii
6402012	Mycoplasma gallisepticum      
141106	Staphylococcus aureus         
[ Info: 11 schema found.

Download and install a cgMLST scheme from cgmlst.org


In [11]:
mentalist download_cgmlst -h


usage: mentalist download_cgmlst --db DB -k K [--threads THREADS]
                        [-c ALLELE_COVERAGE] -o OUTPUT -s SCHEME [-h]

optional arguments:
  --db DB               Output file (kmer database)
  -k K                  Kmer size (type: Int8)
  --threads THREADS     Number of threads used in parallel. (type:
                        Int64, default: 2)
  -c, --allele_coverage ALLELE_COVERAGE
                        Minimum percentage of allele coverage in
                        number of kmers of each allele. (type:
                        Float64, default: 1.0)
  -o, --output OUTPUT   Output folder for the scheme Fasta files.
  -s, --scheme SCHEME   Species name or scheme ID.
  -h, --help            show this help message and exit


In [12]:
mentalist download_cgmlst -o mtb_cgmlst_fasta -s 741110 -k 31 --db mtb_cgmlst.db --threads 16


[ Info: Downloading cgMLST scheme ...
Trying to download: https://www.cgmlst.org/ncs/schema/741110/alleles
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.2M    0 11.2M    0     0   759k      0 --:--:--  0:00:15 --:--:-- 1307k
[ Info: Unzipping cgMLST scheme into individual FASTA files for each locus ...
...............
[ Info: 2891 loci found.
[ Info: Building the k-mer database ...
[ Info: Opening FASTA files ... 
[ Info: Combining results for each locus ...
[ Info: Saving DB ...
[ Info: Done!

Install a custom scheme from FASTA files

It is also possible to install a custom MLST scheme from the FASTA files. Each file should be called LOCUS.fa (the extension is not important, can be .fasta, .tfa, etc.), and each different allele in this file should have identifier LOCUS_N (or alternatively LOCUS.N), where N is a unique number for each allele, and it is usually a sequence from 1 to N for N alleles.

For instance, let's test this functionality with the Campylobacter scheme FASTA files that were downloaded in a previous example above:


In [13]:
# Each file is a different locus:
ls campy_mlst_fasta_files/*.tfa


campy_mlst_fasta_files/aspA.tfa  campy_mlst_fasta_files/pgm.tfa
campy_mlst_fasta_files/glnA.tfa  campy_mlst_fasta_files/tkt.tfa
campy_mlst_fasta_files/gltA.tfa  campy_mlst_fasta_files/uncA.tfa
campy_mlst_fasta_files/glyA.tfa

In [14]:
# For each locus file, a different ID and sequence for each allele:
head -n 18 campy_mlst_fasta_files/glnA.tfa


>glnA_1
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATAGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT
>glnA_2
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATGGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT

In [15]:
# Install the Campylobacter jejuni scheme directly from the FASTA files; let's use a different k-mer length:
mentalist build_db -k 25 --db campy_mlst_25.db -p campy_mlst_fasta_files/campylobacter.txt -f campy_mlst_fasta_files/*.tfa


[ Info: Opening FASTA files ... 
[ Info: Combining results for each locus ...
[ Info: Saving DB ...
[ Info: Done!

Calling MLST alleles for a sample

After a k-mer database has been created, MentaLiST can call alleles for a given sample.


In [16]:
# Help:
mentalist call -h


usage: mentalist call -o O --db DB [-t MUTATION_THRESHOLD] [--kt KT]
                      [--output_votes] [--output_special]
                      [-i SAMPLE_INPUT_FILE] [-1 [_1...]] [-2 [_2...]]
                      [--fasta] [-h]

optional arguments:
  -o O                  Output file with MLST call
  --db DB               Kmer database
  -t, --mutation_threshold MUTATION_THRESHOLD
                        Maximum number of mutations when looking for
                        novel alleles. (type: Int64, default: 6)
  --kt KT               Minimum # of times a kmer is seen to be
                        considered present in the sample (solid).
                        (type: Int64, default: 10)
  --output_votes        Outputs the results for the original voting
                        algorithm.
  --output_special      Outputs a FASTA file with the alleles from
                        'special cases' such as incomplete coverage,
                        novel, and multiple alleles.
  -i, --sample_input_file SAMPLE_INPUT_FILE
                        Input TXT file for multiple samples. First
                        column has the sample name, second the FASTQ
                        file. Repeat the sample name for samples with
                        more than one file (paired reads, f.i.)
  -1 [_1...]            FastQ input files, one per sample, forward
                        reads (or unpaired reads).
  -2 [_2...]            FastQ input files, one per sample, reverse
                        reads.
  --fasta               Input files are in FASTA format, instead of
                        the default FASTQs.
  -h, --help            show this help message and exit

MentaLiST MLST calling function. Calls alleles on a given MLST database.
You can create a custom DB with 'create_db' or other MentaLiST functions that download schemes from pubmlst, cgmlst.org or Enterobase.

Examples:
mentalist call -o my_sample.mlst --db my_scheme.db -1 sample_1.fastq.gz -2 sample_2.fastq.gz # one paired-end sample.
mentalist call -o all_samples.mlst --db my_scheme.db -1 *.fastq.gz -2 *.fastq.gz # multiple paired-end samples.

For this example we are using a Campylobacter jejuni sample from EMBL ENA. You can download the FASTQ file with the following command:


In [17]:
# the --no-clobber option checks if the file already exists, so it does not download it again.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR582/007/SRR5824107/SRR5824107_1.fastq.gz --no-clobber


File ‘SRR5824107_1.fastq.gz’ already there; not retrieving.

Now, run MentaLiST caller on this sample, passing the MentaLiST database that we created previously, using the --db flag.


In [18]:
mentalist call -o campy_call.txt --db campy_mlst.db -1 SRR5824107_1.fastq.gz


[ Info: Opening kmer database ... 
[ Info: Finished the JLD load, building alleles list...
[ Info: Decompressing weight list...
[ Info: Building kmer index ...
[ Info: Sample: SRR5824107_1. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Writing output ...
[ Info: Done.

The output consists of two files: one has the calls, and the other some details about the coverage of calls and special cases.


In [19]:
# results:
ls campy_call.*


campy_call.txt               campy_call.txt.novel.fa
campy_call.txt.coverage.txt  campy_call.txt.novel.txt

In [20]:
# Allele calls and ST are on the campy_call.txt file:
column -ts $'\t' campy_call.txt


Sample        aspA  glnA  gltA  glyA  pgm  tkt  uncA  ST   clonal_complex
SRR5824107_1  2     17    2     3     2    1    5     883  ST-21 complex

In [21]:
# Detailed vote count for each allele:
cat campy_call.txt.coverage.txt


Sample	Locus	Coverage	MinKmerDepth	Call
SRR5824107_1	aspA	1.0	61	Called allele 2.
SRR5824107_1	glnA	1.0	34	Called allele 17.
SRR5824107_1	gltA	1.0	62	Called allele 2.
SRR5824107_1	glyA	1.0	46	Called allele 3.
SRR5824107_1	pgm	1.0	15	Called allele 2.
SRR5824107_1	tkt	1.0	34	Called allele 1.
SRR5824107_1	uncA	1.0	54	Called allele 5.

M. tuberculosis sample on the cgMLST scheme.

Now we test the MentaLiST call on a M. tuberculosis sample, also downloaded from EMBL ENA. This time we are going to download both paired end FASTQ files:


In [22]:
# the --no-clobber option checks if the file already exists, so it does not download it again.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_{1,2}.fastq.gz --no-clobber


File ‘SRR6152708_1.fastq.gz’ already there; not retrieving.
File ‘SRR6152708_2.fastq.gz’ already there; not retrieving.

For this example, we will use the flags --output_votes and --output_special, that tell MentaLiST to create additional output files. To use paired end samples, just include both files at the end of the command:


In [23]:
## Call alleles for the sample:
mentalist call -o SRR6152708.txt --output_votes --output_special  --db mtb_cgmlst.db -1 SRR6152708_1.fastq.gz -2 SRR6152708_2.fastq.gz


[ Info: Opening kmer database ... 
[ Info: Finished the JLD load, building alleles list...
[ Info: Decompressing weight list...
[ Info: Building kmer index ...
[ Info: Sample: SRR6152708. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Writing output ...
[ Info: Done.

In addition to the regular output files from the previous example (legionella.txt and legionella.txt.coverage.txt), there are some new files, due to the use of flags --output_votes and --output_special.


In [24]:
ls -l SRR6152708.txt*


-rw-rw-r--. 1 pfeijao pfeijao   27602 Feb 14 10:43 SRR6152708.txt
-rw-rw-r--. 1 pfeijao pfeijao   27643 Feb 14 10:43 SRR6152708.txt.byvote
-rw-rw-r--. 1 pfeijao pfeijao  128763 Feb 14 10:43 SRR6152708.txt.coverage.txt
-rw-rw-r--. 1 pfeijao pfeijao   89361 Feb 14 10:43 SRR6152708.txt.novel.fa
-rw-rw-r--. 1 pfeijao pfeijao    5838 Feb 14 10:43 SRR6152708.txt.novel.txt
-rw-rw-r--. 1 pfeijao pfeijao  205768 Feb 14 10:43 SRR6152708.txt.special_cases.fa
-rw-rw-r--. 1 pfeijao pfeijao     719 Feb 14 10:43 SRR6152708.txt.ties.txt
-rw-rw-r--. 1 pfeijao pfeijao 1209964 Feb 14 10:43 SRR6152708.txt.votes.txt

Let's do a quick check of the first 12 calls:


In [25]:
cut -f1-12 SRR6152708.txt | column -ts $'\t'


Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023  Rv0024  Rv0025
SRR6152708  5        2        1        1        2        1        1        1        1       N       1

The coverage file has more details of each call. Looking at the first lines, we can see that MentaLiST found some novel alleles:


In [26]:
head -n 15 SRR6152708.txt.coverage.txt


Sample	Locus	Coverage	MinKmerDepth	Call
SRR6152708	Rv0014c	1.0	43	Called allele 5.
SRR6152708	Rv0015c	1.0	18	Called allele 2.
SRR6152708	Rv0016c	1.0	45	Called allele 1.
SRR6152708	Rv0017c	1.0	45	Called allele 1.
SRR6152708	Rv0018c	1.0	33	Called allele 2.
SRR6152708	Rv0019c	1.0	39	Called allele 1.
SRR6152708	Rv0021c	1.0	35	Called allele 1.
SRR6152708	Rv0022c	1.0	32	Called allele 1.
SRR6152708	Rv0023	1.0	57	Called allele 1.
SRR6152708	Rv0024	1.0	42	Novel, 1 mutation from allele 98: Del of len 1 at pos 719
SRR6152708	Rv0025	1.0	55	Called allele 1.
SRR6152708	Rv0033	1.0	48	Called allele 1.
SRR6152708	Rv0034	1.0	45	Called allele 2.
SRR6152708	Rv0035	1.0	46	Novel, 2 mutations from allele 227: Subst C->G at pos 47, Subst A->T at pos 76

Let's look a bit more into the novel alleles found.

Novel allele detection

The .novel.txt file has all novel alleles with descriptions. For each locus, we can see which allele MentaLiST used as a template, and which mutations were applied to discover the new allele.


In [40]:
head SRR6152708.txt.novel.txt | column -ts $'\t'


Sample      Locus    Novel_id  MinKmerDepth  Nmut  Desc
SRR6152708  Rv0024   N1        42            1     From allele 98, Del of len 1 at pos 719.
SRR6152708  Rv0035   N1        46            2     From allele 227, Subst C->G at pos 47, Subst A->T at pos 76.
SRR6152708  Rv0045c  N1        61            4     From allele 62, Subst C->T at pos 318, Del of len 2 at pos 650, Subst A->G at pos 652.
SRR6152708  Rv0063   N1        55            1     From allele 140, Ins of base G at pos 334.
SRR6152708  Rv0101   N1        50            2     From allele 1541, Subst A->G at pos 5360, Subst A->G at pos 6088.
SRR6152708  Rv0134   N1        66            2     From allele 25, Subst G->T at pos 374, Del of len 1 at pos 386.
SRR6152708  Rv0165c  N1        58            3     From allele 16, Subst T->C at pos 147, Ins of base G at pos 163, Ins of base G at pos 164.
SRR6152708  Rv0195   N1        63            2     From allele 88, Subst G->A at pos 185, Subst T->C at pos 191.
SRR6152708  Rv0226c  N1        66            2     From allele 59, Subst A->C at pos 36, Subst A->G at pos 1229.

The novel allele DNA sequences are on the .novel.fa FASTA file.


In [28]:
head -n20 SRR6152708.txt.novel.fa


>Rv0024_N1 Seen in 1 sample(s).
GTGAATACAGCGAGGTCGAGCTGTTGAGTCGCGCTCATCAACTGTTCGCCGGAGACAGTCGGCGACCGGGGTTGGATGCGGGCACCACACCCTACGGGGATCTGCTGTCTCGGGCTGCCG
ACCTGAATGTGGGTGCGGGCCAGCGCCGGTATCAACTCGCCGTGGACCACAGCCGGGCGGCCTTGCTGTCTGCTGCGCGAACCGATGCCGCGGCCGGGGCCGTCATCACCGGCGCTCAAC
GGGATCGGGCATGGGCCCGGCGGTCGACCGGAACCGTTCTCGACGAGGCTCGCTCGGATACCACCGTTACTGCGGTTATGCCGATAGCCCAGCGCGAAGCCATACGCCGTCGTGTGGCGC
GGCTGCGCGCGCAACGAGCCCATGTGCTGACGGCGCGACGACGGGCACGACGGCACCTGGCGGCGCTGCGTGCGCTGCGGTACCGGGTGGCGCACGGCCCGGGGGTCGCGCTGGCCAAAC
TTCGGCTGCCGTCGCCGAGCGGTCGCGCCGGCATCGCGGTCCACGCCGCGCTGTCGCGACTTGGCCGTCCCTATGTCTGGGGCGCAACGGGGCCCAACCAGTTCGACTGTTCCGGTTTGG
TCCAGTGGGCCTACGCCCAGGCGGGTGTTCACCTGGATCGCACCACCTATCAACAGATCAACGAGGGGATCCCGGTGCCGCGCTCACAGGTCCGGCCGGGCGATCTGGTCTTCCCGCACC
CCGGGCACGTGCAGCTGGCGATCGGCAACAATCTGGTCGTCGAGGCGCCCCATGCGGGCGCGTCGGTTCGGGTCAGCTCGCTGGGCAACAACGTGCAGATTCGGCGACCGCTGAGTGGCA
GATAA
>Rv0035_N1 Seen in 1 sample(s).
ATGACGGCGGCCTTGCTTTCACCAGCCATCGCCTGGCAGCAGATCTGGGCTTGCACGGACCGCACGCTGACGATCTCTTGCGAGGATTCCGAGGTAATCAGCTATCAGGACCTCATCGCG
CGCGCGGCGGCATGCATCCCCCCGCTACGGCGTCTTGACCTCAAACGCGGTGAACCCGTGCTGATCACCGCCCACACCAACCTGGAATTCCTGTCCTGCTTTTTGGGCCTCATGCTCCAT
GGCGCTGTGCCGGTACCCATCCCGCCGCGGGAGGCACTGAAGACCACCGAGCGTTTCATGACTCGGCTCGGCCCACTGCTGCGCCATCACCGCGTGCTGATCTGCACACCGGCCGAACAC
GACGAGATACGCGCTGCCGCCAGCACCGACTGCCAGATCAGCAGATTTACTGCCCTAGCCGAGGCTGGCGACGAGCAGTTCGGCCGCGCCACGGCCCAGCAACTCGCCGACACCGCCACC
GCCGACTGGCCGCTATGCACCCTCGACGACGACGCCTACGTCCAATACACCTCTGGCAGCACCGCAGCACCACGCGGAGTGGTCATCACCTACCGCAACCTGCTGTCCAACATGCGCGCA
ATGGCCGTGGGCTCACAATTCCAGCACGGCGATGTCATGGGCAGCTGGCTGCCCTTGCACCATGACATGGGGCTGGTGGGCAGCCTATTCGCCGCACTCTTCAACAGTGTCAGCGCGGTA
TTCACCACGCCACACCGGTTTCTGTATGACCCGTTGGGATTCCTCAGACTGCTCACCAGCTCCGGGGCTACCCACACGTTCATGCCTAACTTCGCTCTGGAGTGGCTGATCAACGCCTAC
CACAGGCGCGGCGCCGACATCGAAGGCATCGACCTACACAAAATGCGCCGCTTGATCATCGCCTCCGAACCCGTCCATGCCGAGGGCATGCGGAGATTCGCCGCCACCTTCGCCGGCGTC
GGACTTGCCCCCACGGCCCTGGGTTCGGGCTATGGCCTGGCCGAAGCGACCGTCGCCGTGTCAATGTCAGCGCCCAACACGGGATTCCGCACCGAAACCCACGCCGCCGCGGAGGTCGTC
ACCGGCGGCCGAGTGCTGCCTGGCTACGAGGTGCGCATTGACGCCGCACCAGGTGCCCGGGCCGGAACGATCAAACTGCGCGGCGACAGCGTGGCCGCCAAAGCCTATGTGGGCGGGAAG

Optional: outputting the voting calls

The --output-votes flag makes MentaLiST output three additional files: SRR6152708.txt.byvote, SRR6152708.txt.votes.txt and SRR6152708.txt.ties.txt. These are the results of the old calling algorithm in MentaLiST 0.1, where only the top voted allele is called. In the current version, MentaLiST 1.0 checks the allele sequences to ensure that the called allele has full coverage, and also tries to find novel alleles.


In [29]:
# Calls by the old voting algorithm:
cut -f1-12 SRR6152708.txt.byvote | column -ts $'\t'


Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023  Rv0024  Rv0025
SRR6152708  5        2        1        1        2        1        1        1        1       165     1

The SRR6152708.txt.votes.txt file has the top voted alleles on each loci:


In [30]:
head -n12 SRR6152708.txt.votes.txt


Sample	Locus	Total locus votes	Allele(relative votes),...
SRR6152708	Rv0014c	111048	5(1822),491(662),485(338),354(278),310(259),552(247),336(242),33(235),137(221),16(126),164(119),246(119),427(111),60(69),191(67),165(39),542(0),1(0),367(-14),162(-32),458(-48),266(-56),531(-65),189(-80),232(-95),284(-137),234(-159),203(-164),350(-167),245(-177),135(-180),178(-189),35(-206),355(-210),545(-217),307(-232),493(-235),559(-236),369(-236),329(-239)
SRR6152708	Rv0015c	66555	2(1710),219(1524),306(1088),282(1085),153(1081),201(1076),25(1069),304(1065),204(1057),133(1042),131(1035),303(1031),155(1021),286(989),280(945),327(882),416(607),384(586),188(431),119(431),288(407),360(225),321(207),197(166),161(42),322(35),187(27),296(22),1(0),189(-7),168(-47),46(-87),33(-109),230(-154),302(-177),136(-226),403(-268),85(-283),120(-297),247(-322)
SRR6152708	Rv0016c	97432	1(0),278(-721),249(-731),65(-802),391(-1093),286(-1102),141(-1222),190(-1479),108(-1489),10(-1490),137(-1503),126(-1513),251(-1521),67(-1525),362(-1537),236(-1546),28(-1557),33(-1578),219(-1590),110(-1633),208(-1680),9(-1707),176(-1712),328(-1717),351(-1737),312(-1747),151(-1747),40(-1747),322(-1747),274(-1748),329(-1749),332(-1749),133(-1750),105(-1753),356(-1753),7(-1760),79(-1765),106(-1768),96(-1772),179(-1772)
SRR6152708	Rv0017c	84918	1(0),281(-60),32(-365),395(-551),367(-673),132(-686),267(-686),173(-822),71(-1025),38(-1072),216(-1127),285(-1208),243(-1238),291(-1295),289(-1371),172(-1515),413(-1548),55(-1577),114(-1578),68(-1579),115(-1580),219(-1580),350(-1583),311(-1585),296(-1585),373(-1585),230(-1587),104(-1590),127(-1591),277(-1593),335(-1594),158(-1596),195(-1597),376(-1598),331(-1599),369(-1600),64(-1600),61(-1600),156(-1600),205(-1602)
SRR6152708	Rv0018c	91523	2(0),228(-46),176(-312),337(-334),358(-418),257(-559),366(-733),312(-733),128(-1077),334(-1162),191(-1208),67(-1225),295(-1241),341(-1307),233(-1330),410(-1331),66(-1333),316(-1345),363(-1346),370(-1347),293(-1347),307(-1352),218(-1356),439(-1358),451(-1358),247(-1365),132(-1366),399(-1366),53(-1366),392(-1369),189(-1370),155(-1373),215(-1376),71(-1378),385(-1378),441(-1381),224(-1384),309(-1389),204(-1391),367(-1391)
SRR6152708	Rv0019c	26728	1(0),13(-292),103(-545),113(-965),62(-965),90(-1055),4(-1055),32(-1148),117(-1330),84(-1360),101(-1360),35(-1361),81(-1361),75(-1373),12(-1393),53(-1442),43(-1443),69(-1534),25(-1548),23(-1548),63(-1575),56(-1632),31(-1656),123(-1656),27(-1656),18(-1661),19(-1668),15(-1668),16(-1675),10(-1677),47(-1682),91(-1686),76(-1686),98(-1686),49(-1691),45(-1696),126(-1708),111(-1714),48(-1714),107(-1725)
SRR6152708	Rv0021c	56835	1(0),90(-63),183(-63),70(-188),359(-188),277(-248),75(-248),115(-376),242(-442),164(-504),97(-632),261(-998),253(-1147),209(-1166),69(-1197),332(-1243),67(-1251),38(-1263),345(-1270),290(-1270),210(-1306),174(-1313),245(-1315),250(-1323),118(-1323),68(-1335),49(-1336),73(-1340),133(-1359),216(-1367),208(-1382),251(-1390),17(-1397),86(-1400),322(-1416),336(-1443),313(-1448),154(-1481),213(-1518),149(-1546)
SRR6152708	Rv0022c	17489	1(0),95(-98),56(-126),12(-166),73(-205),3(-239),52(-339),72(-388),94(-436),163(-437),64(-476),22(-485),107(-575),101(-601),68(-619),8(-645),54(-664),9(-708),65(-753),46(-753),30(-882),85(-909),134(-954),128(-954),142(-954),88(-968),59(-1088),121(-1089),58(-1105),45(-1105),158(-1114),125(-1120),99(-1120),61(-1121),34(-1122),131(-1126),66(-1126),152(-1126),89(-1128),141(-1130)
SRR6152708	Rv0023	53210	1(0),205(-145),13(-145),163(-219),199(-292),42(-327),89(-435),39(-642),34(-712),85(-782),70(-818),218(-851),47(-960),137(-1069),103(-1507),155(-1549),9(-1579),120(-1651),167(-1720),44(-1824),111(-1825),132(-1829),268(-1831),250(-1832),256(-1834),74(-1840),24(-1853),128(-1864),226(-1888),214(-1892),116(-1892),114(-1894),117(-1894),78(-1904),179(-1914),259(-1919),119(-1927),158(-1928),6(-1936),82(-1938)
SRR6152708	Rv0024	53208	165(0),217(0),1(0),222(0),162(-120),204(-300),266(-418),100(-550),110(-658),19(-710),248(-774),271(-791),252(-945),264(-1012),132(-1022),98(-1055),184(-1250),187(-1250),226(-1404),107(-1440),152(-1441),237(-1445),94(-1452),173(-1472),9(-1481),238(-1481),67(-1515),86(-1517),134(-1543),198(-1551),65(-1553),77(-1558),139(-1566),117(-1569),305(-1591),192(-1594),106(-1605),172(-1610),112(-1617),119(-1629)
SRR6152708	Rv0025	21944	1(0),54(-140),14(-294),60(-547),82(-870),120(-870),117(-989),66(-1071),62(-1115),90(-1279),5(-1351),61(-1415),65(-1643),115(-1643),8(-1659),32(-1717),58(-1787),106(-1790),24(-1790),25(-1792),104(-1792),92(-1792),121(-1817),12(-1820),38(-1834),59(-1835),40(-1836),47(-1840),39(-1850),94(-1851),2(-1863),64(-1863),30(-1868),105(-1876),80(-1877),88(-1877),95(-1880),107(-1898),20(-1899),85(-1902)

As we can see, there is a tie on locus Rv0024. This might happen, specially on loci with novel alleles. The SRR6152708.txt.ties.txt file has a list of loci where there is a tie for most voted alleles, listing all tied alleles:


In [31]:
cat SRR6152708.txt.ties.txt


Sample	Locus	Tied Alleles
SRR6152708	Rv0024	1, 165, 217, 222
SRR6152708	Rv0101	8, 815, 1602
SRR6152708	Rv0538	2, 40, 68, 159, 208, 231, 258, 270
SRR6152708	Rv0757	1, 117
SRR6152708	Rv0818	1, 29
SRR6152708	Rv0826	1, 38
SRR6152708	Rv1097c	1, 185
SRR6152708	Rv1363c	1, 104
SRR6152708	Rv1417	1, 5, 6, 9, 10, 13, 16, 17, 21, 26, 32, 44, 45, 48, 51, 52, 53, 58, 60, 61, 70, 72, 78, 79, 80, 81, 88, 90, 91, 95, 98, 100, 102, 104
SRR6152708	Rv2148c	1, 84, 115
SRR6152708	Rv2176	1, 135
SRR6152708	Rv2330c	1, 3
SRR6152708	Rv2526	1, 31
SRR6152708	Rv2949c	1, 48, 162
SRR6152708	Rv2975c	1, 35
SRR6152708	Rv3091	1, 277, 345
SRR6152708	Rv3234c	1, 16
SRR6152708	Rv3253c	1, 107
SRR6152708	Rv3736	4, 279
SRR6152708	Rv3830c	1, 5, 50, 116

If we check those alleles in the MentaLiST coverage report, we can see that we had some different cases:


In [32]:
for p in $(cut -f2 SRR6152708.txt.ties.txt); do grep $p SRR6152708.txt.coverage.txt; done | column -ts $'\t'


Sample      Locus    Coverage  MinKmerDepth  Call
SRR6152708  Rv0024   1.0       42            Novel, 1 mutation from allele 98: Del of len 1 at pos 719
SRR6152708  Rv0101   1.0       50            Novel, 2 mutations from allele 1541: Subst A->G at pos 5360, Subst A->G at pos 6088
SRR6152708  Rv0538   1.0       20            Called allele 2.
SRR6152708  Rv0757   1.0       49            Novel, 2 mutations from allele 141: Subst T->C at pos 364, Subst T->C at pos 373
SRR6152708  Rv0818   1.0       56            Novel, 3 mutations from allele 89: Subst G->A at pos 275, Subst C->T at pos 290, Subst G->A at pos 293
SRR6152708  Rv0826   1.0       49            Novel, 2 mutations from allele 84: Subst C->G at pos 901, Subst G->A at pos 920
SRR6152708  Rv1097c  1.0       62            Novel, 3 mutations from allele 135: Subst A->G at pos 302, Del of len 2 at pos 311
SRR6152708  Rv1363c  1.0       43            Novel, 2 mutations from allele 61: Subst C->A at pos 59, Subst C->G at pos 75
SRR6152708  Rv1417   0.0       0             Not present; allele 58 is the best covered but below threshold with 188/435 missing kmers.
SRR6152708  Rv2148c  1.0       75            Novel, 2 mutations from allele 87: Subst G->C at pos 19, Del of len 1 at pos 4
SRR6152708  Rv2176   1.0       53            Novel, 2 mutations from allele 53: Subst C->T at pos 75, Subst T->C at pos 95
SRR6152708  Rv2330c  1.0       43            Called allele 1.
SRR6152708  Rv2526   1.0       53            Novel, 6 mutations from allele 4: Subst A->G at pos 45, Del of len 2 at pos 72, Del of len 3 at pos 73
SRR6152708  Rv2949c  1.0       32            Called allele 1.
SRR6152708  Rv2975c  1.0       51            Novel, 3 mutations from allele 42: Subst G->A at pos 32, Del of len 2 at pos 6
SRR6152708  Rv3091   1.0       59            Novel, 2 mutations from allele 166: Subst T->C at pos 1289, Subst G->A at pos 1299
SRR6152708  Rv3234c  1.0       60            Novel, 2 mutations from allele 19: Ins of base C at pos 18, Subst G->T at pos 57
SRR6152708  Rv3253c  1.0       36            Novel, 2 mutations from allele 278: Subst G->A at pos 447, Subst T->C at pos 462
SRR6152708  Rv3736   1.0       75            Novel, 2 mutations from allele 83: Subst T->C at pos 510, Subst A->G at pos 755
SRR6152708  Rv3830c  1.0       61            Novel, 1 mutation from allele 105: Ins of base G at pos 147
  • Unique full coverage: For loci Rv0538, Rv2330c and Rv2949c, MentaLiST could find that only one of the top voted alleles had full coverage, and made the call.
  • Missing allele: For Allele Rv1417, MentaLiST called it as not present, since it only has 188/435 < 50% coverage; this might be due a poorly covered region in the sample, or because the gene is really not present in the sample, but some other regions in the genome have some similarity with this gene, causing the partial $k$-mer match.

  • Novel allele: For all the other loci where there was a tie, MentaLiST found a putative novel allele.

Special Cases

There are some possible special cases, that make MentaLiST flag (by adding an additional character) the call file.

Multiple possible alleles

This happens when more that one allele has full coverage in the sample:


In [33]:
grep Multiple SRR6152708.txt.coverage.txt


SRR6152708	Rv0471c	1.0	43	Multiple possible alleles:1, 118 with depth 43, 39 and votes 0, -724. Most voted (1) is chosen on call file.
SRR6152708	Rv1318c	1.0	36	Multiple possible alleles:302, 1 with depth 36, 36 and votes 4427, 0. Most voted (302) is chosen on call file.
SRR6152708	Rv1319c	1.0	35	Multiple possible alleles:8, 3 with depth 35, 35 and votes 3930, 3402. Most voted (8) is chosen on call file.
SRR6152708	Rv1911c	1.0	26	Multiple possible alleles:1, 118 with depth 26, 26 and votes 0, -244. Most voted (1) is chosen on call file.
SRR6152708	Rv2319c	1.0	28	Multiple possible alleles:7, 1 with depth 28, 30 and votes 101, 0. Most voted (7) is chosen on call file.

On these cases, MentaLiST chooses the most voted allele, but included a flag "+" in the output:


In [41]:
# I nice trick to find the column number, given a column name:
awk -v RS='\t' '/Rv1319c/{print NR; exit}' SRR6152708.txt


1023

In [42]:
cut -f 1023 SRR6152708.txt


Rv1319c
8+

Missing locus - coverage below threshold

This happens when there are some $k$-mers from the locus in the sample, but below the minimum threshold of 50%.


In [35]:
grep "Not present" SRR6152708.txt.coverage.txt


SRR6152708	Rv1417	0.0	0	Not present; allele 58 is the best covered but below threshold with 188/435 missing kmers.

In this case, MentaLiST outputs a zero (0) in the call file if it did not find any $k$-mers, and (0?) if it did find some $k$-mers from this locus but below the threshold.


In [49]:
awk -v RS='\t' '/Rv1417/{print NR}' SRR6152708.txt


1094

In [36]:
cut -f 1094 SRR6152708.txt


Rv1417
0?

Partially covered alleles

This happens when an allele is partially covered (>50% and <100%), and MentaLiST fails to find a fully covered novel allele.


In [37]:
grep Partial SRR6152708.txt.coverage.txt


SRR6152708	Rv0275c	0.986	0	Partially covered allele or novel allele; Best allele 26 has 10/696 missing kmers, and no novel was found. Gaps on positions: (635, 665)
SRR6152708	Rv0581	0.973	0	Partially covered allele or novel allele; Best allele 1 has 5/186 missing kmers, and no novel was found. Gaps on positions: (115, 119)
SRR6152708	Rv0860	0.987	0	Partially covered allele or novel allele; Best allele 331 has 28/2133 missing kmers, and no novel was found. Gaps on positions: (2096, 2133)
SRR6152708	Rv1860	0.9526	0	Partially covered allele or novel allele; Best allele 152 has 45/948 missing kmers, and no novel was found. Gaps on positions: (855, 901)
SRR6152708	Rv1999c	0.9575	0	Partially covered allele or novel allele; Best allele 5 has 55/1293 missing kmers, and no novel was found. Gaps on positions: (1182, 1236)
SRR6152708	Rv2249c	0.998	0	Partially covered allele or novel allele; Best allele 1 has 3/1521 missing kmers, and no novel was found. Gaps on positions: (1, 3)
SRR6152708	Rv3017c	0.775	0	Partially covered allele or novel allele; Best allele 1 has 75/333 missing kmers, and no novel was found. Gaps on positions: (1, 75)
SRR6152708	Rv3394c	0.9614	0	Partially covered allele or novel allele; Best allele 85 has 60/1545 missing kmers, and no novel was found. Gaps on positions: (1462, 1541)
SRR6152708	Rv3795	0.9995	0	Partially covered allele or novel allele; Best allele 1 has 2/3267 missing kmers, and no novel was found. Gaps on positions: (28, 29)

In this case, the output is in the format x-, where x is the most covered allele found, since MentaLiST is not sure if this is the correct call of a partially covered allele, or it could be a novel allele that was not detected. Here we show three of the above loci as an example:


In [54]:
awk -v RS='\t' '/Rv0275c/||/Rv0581/||/Rv0860/ {print NR}' SRR6152708.txt


217
467
674

In [55]:
cut -f 217,467,674 SRR6152708.txt


Rv0275c	Rv0581	Rv0860
26-	1-	331-

Running MentaLiST on multiple samples

You can also run MentaLiST for all samples of a dataset. There are two ways of doing this, either specifying all samples in the command line (the suggested way), or by creating a file describing your data (gives you more control).

Let's download some more samples to try it:


In [57]:
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/002/SRR6397472/SRR6397472_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/006/SRR6398036/SRR6398036_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/003/SRR6398023/SRR6398023_{1,2}.fastq.gz --no-clobber


File ‘SRR6397472_1.fastq.gz’ already there; not retrieving.

File ‘SRR6397472_2.fastq.gz’ already there; not retrieving.

File ‘SRR6398036_1.fastq.gz’ already there; not retrieving.

File ‘SRR6398036_2.fastq.gz’ already there; not retrieving.

File ‘SRR6152708_1.fastq.gz’ already there; not retrieving.

File ‘SRR6152708_2.fastq.gz’ already there; not retrieving.

File ‘SRR6398023_1.fastq.gz’ already there; not retrieving.

File ‘SRR6398023_2.fastq.gz’ already there; not retrieving.

Using parameters -1 and -2

Using the same -1 and -2 parameters as before, you can specify multiple paired-end samples, and MentaLiST will figure out the sample name for each pair. This, combined with BASH wildcard characters, makes running MentaLiST on several samples at a time quite easy:


In [59]:
mentalist call -o my_dataset_calls1.txt --db mtb_cgmlst.db -1 SRR6*_1.fastq.gz  -2 SRR6*_2.fastq.gz


[ Info: Opening kmer database ... 
[ Info: Finished the JLD load, building alleles list...
[ Info: Decompressing weight list...
[ Info: Building kmer index ...
[ Info: Sample: SRR6152708. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6397472. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6398023. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6398036. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Writing output ...
[ Info: Done.

The result files will be the same as with a single sample, but with all samples combined on each file. All output files have a 'Sample' column to identify the sample.


In [62]:
cut -f1-12 my_dataset_calls1.txt | column -ts $'\t'


Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023  Rv0024  Rv0025
SRR6152708  5        2        1        1        2        1        1        1        1       N       1
SRR6397472  1        1        1        1        2        1        1        1        1       1       N
SRR6398023  1        1        1        1        19       1        1        1        1       1       1
SRR6398036  1        1        1        1        19       1        1        1        1       1       1

Creating an input file

The input file should be a tabular file with two columns; the first has the sample name, and the second has a FASTQ file for the sample. In the case of multiple files per sample (paired-end reads or other cases), simply one row per file, with the same sample identifier.

For instance, for this 4 sample example dataset, the input file is:


In [64]:
cat my_dataset_samples.txt


SRR6152708	SRR6152708_1.fastq.gz
SRR6152708	SRR6152708_2.fastq.gz
SRR6397472	SRR6397472_1.fastq.gz
SRR6397472	SRR6397472_2.fastq.gz
SRR6398036	SRR6398036_1.fastq.gz
SRR6398036	SRR6398036_2.fastq.gz
SRR6398023	SRR6398023_1.fastq.gz
SRR6398023	SRR6398023_2.fastq.gz

In [63]:
mentalist call -o my_dataset_calls2.txt --db mtb_cgmlst.db -i my_dataset_samples.txt


[ Info: Opening kmer database ... 
[ Info: Finished the JLD load, building alleles list...
[ Info: Decompressing weight list...
[ Info: Building kmer index ...
[ Info: Sample: SRR6152708. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6397472. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6398036. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Sample: SRR6398023. Opening fastq file(s) and counting kmers ... 
[ Info: Voting for alleles ... 
[ Info: Calling alleles and novel alleles ...
[ Info: Writing output ...
[ Info: Done.

The results should be exactly the same, between both input methods.


In [65]:
cut -f1-12 my_dataset_calls2.txt | column -ts $'\t'


Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023  Rv0024  Rv0025
SRR6152708  5        2        1        1        2        1        1        1        1       N       1
SRR6397472  1        1        1        1        2        1        1        1        1       1       N
SRR6398036  1        1        1        1        19       1        1        1        1       1       1
SRR6398023  1        1        1        1        19       1        1        1        1       1       1

In [ ]: